ExtraaLearn Project¶

Context¶

The EdTech industry has been surging in the past decade immensely, and according to a forecast, the Online Education market would be worth $286.62bn by 2023 with a compound annual growth rate (CAGR) of 10.26% from 2018 to 2023. The modern era of online education has enforced a lot in its growth and expansion beyond any limit. Due to having many dominant features like ease of information sharing, personalized learning experience, transparency of assessment, etc, it is now preferable to traditional education.

In the present scenario due to the Covid-19, the online education sector has witnessed rapid growth and is attracting a lot of new customers. Due to this rapid growth, many new companies have emerged in this industry. With the availability and ease of use of digital marketing resources, companies can reach out to a wider audience with their offerings. The customers who show interest in these offerings are termed as leads. There are various sources of obtaining leads for Edtech companies, like

  • The customer interacts with the marketing front on social media or other online platforms.
  • The customer browses the website/app and downloads the brochure
  • The customer connects through emails for more information.

The company then nurtures these leads and tries to convert them to paid customers. For this, the representative from the organization connects with the lead on call or through email to share further details.

Objective¶

ExtraaLearn is an initial stage startup that offers programs on cutting-edge technologies to students and professionals to help them upskill/reskill. With a large number of leads being generated on a regular basis, one of the issues faced by ExtraaLearn is to identify which of the leads are more likely to convert so that they can allocate resources accordingly. You, as a data scientist at ExtraaLearn, have been provided the leads data to:

  • Analyze and build an ML model to help identify which leads are more likely to convert to paid customers,
  • Find the factors driving the lead conversion process
  • Create a profile of the leads which are likely to convert

Data Description¶

The data contains the different attributes of leads and their interaction details with ExtraaLearn. The detailed data dictionary is given below.

Data Dictionary

  • ID: ID of the lead

  • age: Age of the lead

  • current_occupation: Current occupation of the lead. Values include 'Professional','Unemployed',and 'Student'

  • first_interaction: How did the lead first interacted with ExtraaLearn. Values include 'Website', 'Mobile App'

  • profile_completed: What percentage of profile has been filled by the lead on the website/mobile app. Values include Low - (0-50%), Medium - (50-75%), High (75-100%)

  • website_visits: How many times has a lead visited the website

  • time_spent_on_website: Total time spent on the website

  • page_views_per_visit: Average number of pages on the website viewed during the visits.

  • last_activity: Last interaction between the lead and ExtraaLearn.

    • Email Activity: Seeking for details about program through email, Representative shared information with lead like brochure of program , etc
    • Phone Activity: Had a Phone Conversation with representative, Had conversation over SMS with representative, etc
    • Website Activity: Interacted on live chat with representative, Updated profile on website, etc
  • print_media_type1: Flag indicating whether the lead had seen the ad of ExtraaLearn in the Newspaper.

  • print_media_type2: Flag indicating whether the lead had seen the ad of ExtraaLearn in the Magazine.

  • digital_media: Flag indicating whether the lead had seen the ad of ExtraaLearn on the digital platforms.

  • educational_channels: Flag indicating whether the lead had heard about ExtraaLearn in the education channels like online forums, discussion threads, educational websites, etc.

  • referral: Flag indicating whether the lead had heard about ExtraaLearn through reference.

  • status: Flag indicating whether the lead was converted to a paid customer or not.

Importing necessary libraries and data¶

In [1]:
import warnings

warnings.filterwarnings("ignore")
from statsmodels.tools.sm_exceptions import ConvergenceWarning

warnings.simplefilter("ignore", ConvergenceWarning)

# Libraries to help with reading and manipulating data

import pandas as pd
import numpy as np

# Library to split data
from sklearn.model_selection import train_test_split

# libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# setting the precision of floating numbers to 5 decimal points
pd.set_option("display.float_format", lambda x: "%.5f" % x)

# To build model for prediction
import statsmodels.stats.api as sms
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm
from statsmodels.tools.tools import add_constant
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler, LabelEncoder, MinMaxScaler

# To tune different models
from sklearn.model_selection import GridSearchCV


# To get diferent metric scores
import sklearn.metrics as metrics

from sklearn.metrics import (
    f1_score,
    accuracy_score,
    recall_score,
    precision_score,
    confusion_matrix,
    classification_report,
    roc_auc_score,
    precision_recall_curve,
    roc_curve,
    make_scorer
)
In [2]:
# Mount Gdrive
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
In [3]:
# Read the dataset

data = pd.read_csv("/content/drive/MyDrive/Python Course/ExtraaLearn.csv")

Data Overview¶

  • Observations
  • Sanity checks
In [4]:
data.shape
Out[4]:
(4612, 15)
In [5]:
data.head()
Out[5]:
ID age current_occupation first_interaction profile_completed website_visits time_spent_on_website page_views_per_visit last_activity print_media_type1 print_media_type2 digital_media educational_channels referral status
0 EXT001 57 Unemployed Website High 7 1639 1.86100 Website Activity Yes No Yes No No 1
1 EXT002 56 Professional Mobile App Medium 2 83 0.32000 Website Activity No No No Yes No 0
2 EXT003 52 Professional Website Medium 3 330 0.07400 Website Activity No No Yes No No 0
3 EXT004 53 Unemployed Website High 4 464 2.05700 Website Activity No No No No No 1
4 EXT005 23 Student Website High 4 600 16.91400 Email Activity No No No No No 0
In [6]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4612 entries, 0 to 4611
Data columns (total 15 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   ID                     4612 non-null   object 
 1   age                    4612 non-null   int64  
 2   current_occupation     4612 non-null   object 
 3   first_interaction      4612 non-null   object 
 4   profile_completed      4612 non-null   object 
 5   website_visits         4612 non-null   int64  
 6   time_spent_on_website  4612 non-null   int64  
 7   page_views_per_visit   4612 non-null   float64
 8   last_activity          4612 non-null   object 
 9   print_media_type1      4612 non-null   object 
 10  print_media_type2      4612 non-null   object 
 11  digital_media          4612 non-null   object 
 12  educational_channels   4612 non-null   object 
 13  referral               4612 non-null   object 
 14  status                 4612 non-null   int64  
dtypes: float64(1), int64(4), object(10)
memory usage: 540.6+ KB
In [7]:
data.isna().sum()
Out[7]:
0
ID 0
age 0
current_occupation 0
first_interaction 0
profile_completed 0
website_visits 0
time_spent_on_website 0
page_views_per_visit 0
last_activity 0
print_media_type1 0
print_media_type2 0
digital_media 0
educational_channels 0
referral 0
status 0

No null values in the dataset.

In [8]:
data.duplicated().sum()
Out[8]:
np.int64(0)

No duplicates.

In [9]:
data['ID'].nunique()
Out[9]:
4612
In [10]:
data.describe().T
Out[10]:
count mean std min 25% 50% 75% max
age 4612.00000 46.20121 13.16145 18.00000 36.00000 51.00000 57.00000 63.00000
website_visits 4612.00000 3.56678 2.82913 0.00000 2.00000 3.00000 5.00000 30.00000
time_spent_on_website 4612.00000 724.01127 743.82868 0.00000 148.75000 376.00000 1336.75000 2537.00000
page_views_per_visit 4612.00000 3.02613 1.96812 0.00000 2.07775 2.79200 3.75625 18.43400
status 4612.00000 0.29857 0.45768 0.00000 0.00000 0.00000 1.00000 1.00000

Data Preprocessing¶

  • Missing value treatment (if needed)
  • Feature engineering (if needed)
  • Outlier detection and treatment (if needed)
  • Preparing data for modeling
  • Any other preprocessing steps (if needed)
In [11]:
data_clean = data.copy()

# Drop unnecessary columns
data_clean.drop(['ID'], axis=1, inplace=True)


reusable_map = {'Low': 1, 'Medium': 2, 'High': 3}
data_clean['profile_completed'] = data_clean['profile_completed'].map(reusable_map)

# Change column names
data_clean = data_clean.rename(columns={'print_media_type1': 'newspaper', 'print_media_type2': 'magazine'})

# Define channel columns
channel_columns = ['newspaper', 'magazine', 'digital_media', 'educational_channels', 'referral']

# Define categorical columns
categorical_cols = data_clean.select_dtypes(include=['object', 'category']).columns.tolist()
In [12]:
print(categorical_cols)
['current_occupation', 'first_interaction', 'last_activity', 'newspaper', 'magazine', 'digital_media', 'educational_channels', 'referral']
In [13]:
# Define numerical columns
numeric_cols = ['age', 'website_visits', 'time_spent_on_website', 'page_views_per_visit']

# Outlier detection and treatment (IQR capping)
for col in numeric_cols:
    Q1 = data_clean[col].quantile(0.25)
    Q3 = data_clean[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    data_clean[col] = np.where(data_clean[col] < lower_bound, lower_bound, np.where(data_clean[col] > upper_bound, upper_bound, data_clean[col]))

Let's create a feature to measure the engagement status for each customer.

In [14]:
# Assign weights (example: visits=0.3, time=0.5, pages=0.2)
weights = {'website_visits': 0.3, 'time_spent_on_website': 0.5, 'page_views_per_visit': 0.2}

data_clean['engagement_score'] = (
    data_clean['website_visits'] * weights['website_visits'] +
    data_clean['time_spent_on_website'] * weights['time_spent_on_website'] +
    data_clean['page_views_per_visit'] * weights['page_views_per_visit']
)
In [15]:
data_clean.head()
Out[15]:
age current_occupation first_interaction profile_completed website_visits time_spent_on_website page_views_per_visit last_activity newspaper magazine digital_media educational_channels referral status engagement_score
0 57.00000 Unemployed Website 3 7.00000 1639.00000 1.86100 Website Activity Yes No Yes No No 1 821.97220
1 56.00000 Professional Mobile App 2 2.00000 83.00000 0.32000 Website Activity No No No Yes No 0 42.16400
2 52.00000 Professional Website 2 3.00000 330.00000 0.07400 Website Activity No No Yes No No 0 165.91480
3 53.00000 Unemployed Website 3 4.00000 464.00000 2.05700 Website Activity No No No No No 1 233.61140
4 23.00000 Student Website 3 4.00000 600.00000 6.27400 Email Activity No No No No No 0 302.45480

Exploratory Data Analysis (EDA)¶

  • EDA is an important part of any project involving data.
  • It is important to investigate and understand the data better before building a model with it.
  • A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
  • A thorough analysis of the data, in addition to the questions mentioned below, should be done.

Questions

  1. Leads will have different expectations from the outcome of the course and the current occupation may play a key role in getting them to participate in the program. Find out how current occupation affects lead status.
  2. The company's first impression on the customer must have an impact. Do the first channels of interaction have an impact on the lead status?
  3. The company uses multiple modes to interact with prospects. Which way of interaction works best?
  4. The company gets leads from various channels such as print media, digital media, referrals, etc. Which of these channels have the highest lead conversion rate?
  5. People browsing the website or mobile application are generally required to create a profile by sharing their personal data before they can access additional information.Does having more details about a prospect increase the chances of conversion?

EDA¶

  • It is a good idea to explore the data once again after manipulating it.

Criteria

Exploratory Data Analysis

  • Problem definition
  • Univariate analysis
  • Bivariate analysis
  • Provide comments on the visualization such as range of attributes, outliers of various attributes.
  • Provide comments on the distribution of the variables
  • Use appropriate visualizations to identify the patterns and insights
  • Key meaningful observations on individual variables and the relationship between variables

Univariate analysis¶

In [16]:
# Function to plot a boxplot and a histogram along the same scale


def histogram_boxplot(data, feature, kde = False, bins = None):

    """
    Boxplot and histogram combined

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (12, 7))
    kde: whether to the show density curve (default False)
    bins: number of bins for histogram (default None)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows = 2,      # Number of rows of the subplot grid = 2
        sharex = True,  # X-axis will be shared among all subplots
        gridspec_kw = {"height_ratios": (0.25, 0.75)})  # Creating the 2 subplots
    sns.boxplot(
        data = data, x = feature, ax = ax_box2, showmeans = True, color = "violet"
    )  # Boxplot will be created and a star will indicate the mean value of the column
    sns.histplot(
        data = data, x = feature, kde = kde, ax = ax_hist2, bins = bins, palette = "winter"
    ) if bins else sns.histplot(
        data = data, x = feature, kde = kde, ax = ax_hist2
    )  # For histogram
    ax_hist2.axvline(
        data[feature].mean(), color = "green", linestyle = "--"
    )  # Add mean to the histogram
    ax_hist2.axvline(
        data[feature].median(), color = "black", linestyle = "-"
    )  # Add median to the histogram

    plt.tight_layout()  # Auto-adjusts the positions of the plots
    plt.show();
    print(feature, 'Skewness: %f' % data_clean[feature].skew())  # Print the skewness of the column
    print('*' * 20)

Numerical columns analysis¶

In [17]:
for col in numeric_cols:
    histogram_boxplot(data_clean, col, kde=True)
No description has been provided for this image
age Skewness: -0.720022
********************
No description has been provided for this image
website_visits Skewness: 0.891856
********************
No description has been provided for this image
time_spent_on_website Skewness: 0.952928
********************
No description has been provided for this image
page_views_per_visit Skewness: 0.204383
********************

Observations on age: Skewed to the left with a mean age of 46 years old.

Observations on website_visits: Skewed to the right, usually 2 or less visits per customer.

Observations on number of time_spent_on_website: Skewed to the right with a mean time spent on the website of 724, however with a high standard deviation of 743.

Observations on number of page_views_per_visit: Multi-modal distribution with most customers visiting between 2 and 3 pages, some spikes around 0 and 6.

Categorical columns analysis¶

In [18]:
# Graph style setup
sns.set(style="whitegrid")

# Graph all the categorical features

for col in categorical_cols:
    # Absolute count and percentage
    value_counts = data_clean[col].value_counts()
    percentage = (value_counts / len(data_clean)) * 100


    # Display table of count and percentage
    print(f"\n--- {col} ---")
    print(pd.DataFrame({'Count': value_counts, '(%)': percentage.round(2)}))

    # Barplot
    plt.figure(figsize=(8, 5))
    sns.barplot(x=value_counts.index, y=value_counts.values, palette="viridis")
    plt.title(f"Distribution of {col}", fontsize=14)
    plt.xlabel(col, fontsize=12)
    plt.ylabel("Frequency", fontsize=12)
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()
--- current_occupation ---
                    Count      (%)
current_occupation                
Professional         2616 56.72000
Unemployed           1441 31.24000
Student               555 12.03000
No description has been provided for this image
--- first_interaction ---
                   Count      (%)
first_interaction                
Website             2542 55.12000
Mobile App          2070 44.88000
No description has been provided for this image
--- last_activity ---
                  Count      (%)
last_activity                   
Email Activity     2278 49.39000
Phone Activity     1234 26.76000
Website Activity   1100 23.85000
No description has been provided for this image
--- newspaper ---
           Count      (%)
newspaper                
No          4115 89.22000
Yes          497 10.78000
No description has been provided for this image
--- magazine ---
          Count      (%)
magazine                
No         4379 94.95000
Yes         233  5.05000
No description has been provided for this image
--- digital_media ---
               Count      (%)
digital_media                
No              4085 88.57000
Yes              527 11.43000
No description has been provided for this image
--- educational_channels ---
                      Count      (%)
educational_channels                
No                     3907 84.71000
Yes                     705 15.29000
No description has been provided for this image
--- referral ---
          Count      (%)
referral                
No         4519 97.98000
Yes          93  2.02000
No description has been provided for this image

Observations on current_occupation: Most of our leads are Professionals, followed by Unemployed and Students.

Observations on number of first_interaction: 55% of first interactions came from the website, followed by Mobile App with 45%.

Observations on last_activity: Very high activity via email, while equally regular by phone or website.

Observations on profile_completed: very good numbers for Medium and High profile completion. Just 2% of low completion profiles.

Observations on newspaper: 10% of the leads come from newspaper.

Observations on magazine (print_media_type2): 5% of the leads come from magazine.

Observations on digital_media: 11.4% of the leads come from digital media.

Observations on educational_channels 15.2% of the leads come from educational channels.

Observations on referral: Only 2% of leads come from referrals.

Observations from Univariate Analysis:The distribution of variables shows a strong preference for digital and educational channels, with high engagement through email and web platforms. Most leads have well-completed profiles, indicating good data quality.

In [19]:
# Calculate conversion % per occupation
conversion = data_clean.groupby('current_occupation')['status'].mean().reset_index()
conversion['conversion_%'] = conversion['status'] * 100
print(conversion)
# Barplot
plt.figure(figsize=(8,5))
sns.barplot(x='current_occupation', y='conversion_%', data=conversion, palette='viridis')
plt.title('Conversion rate per occupation')
plt.ylabel('Conversion rate (%)')
plt.xlabel('Current Occupation')
plt.ylim(0, 100)


plt.show()
  current_occupation  status  conversion_%
0       Professional 0.35512      35.51223
1            Student 0.11712      11.71171
2         Unemployed 0.26579      26.57876
No description has been provided for this image

Most of the conversion comes from Professional or Unemployed leads.

Bivariate analysis¶

In [20]:
# Correlation check
plt.figure(figsize=(10, 7))
sns.heatmap(
    data_clean[numeric_cols].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral"
)
plt.title("Correlation Heatmap")
plt.tight_layout()
plt.show();
No description has been provided for this image

No important correlation between the numeric columns. Page views per visit is somewhat related to the number of website visits.

In [21]:
def stacked_barplot(data, predictor, target):
    """
    Print the category counts and plot a stacked bar chart

    data: dataframe
    predictor: independent variable
    target: target variable
    """
    count = data[predictor].nunique()
    sorter = data[target].value_counts().index[-1]
    tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
        by=sorter, ascending=False
    )
    conversion = data_clean[data_clean[predictor] == 'Yes'][target].mean() * 100
    print(f"% of conversion for {predictor} = 'Yes': {conversion:.2f}%")


    print(tab1)
    print("-" * 120)
    tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
        by=sorter, ascending=False
    )

    tab.plot(kind="bar", stacked=True, figsize=(count + 5, 5))
    plt.legend(
        loc="lower left", frameon=False,
    )
    plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
    plt.show()

Leads will have different expectations from the outcome of the course and the current occupation may play a key role for them to take the program. Let's analyze it

In [22]:
stacked_barplot(data=data_clean, predictor="current_occupation", target="status")
% of conversion for current_occupation = 'Yes': nan%
status                 0     1   All
current_occupation                  
All                 3235  1377  4612
Professional        1687   929  2616
Unemployed          1058   383  1441
Student              490    65   555
------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image

The highest amount of converted leads comes from customers at a Professional level.

Age can be a good factor to differentiate between such leads

In [23]:
plt.figure(figsize=(10, 5))
sns.boxplot(data = data_clean, x = "current_occupation", y = "age")
plt.show()
No description has been provided for this image
In [24]:
data_clean.groupby(["current_occupation"])["age"].describe()
Out[24]:
count mean std min 25% 50% 75% max
current_occupation
Professional 2616.00000 49.34748 9.89074 25.00000 42.00000 54.00000 57.00000 60.00000
Student 555.00000 21.14414 2.00111 18.00000 19.00000 21.00000 23.00000 25.00000
Unemployed 1441.00000 50.14018 9.99950 32.00000 42.00000 54.00000 58.00000 63.00000

Students are our youngest group. Unemployed and Professional have a higher variance with a mean age around 50 years.

The company's first interaction with leads should be compelling and persuasive. Let's see if the channels of the first interaction have an impact on the conversion of leads

In [25]:
# Complete the code to plot stacked_barplot for first_interaction and status
stacked_barplot(data=data_clean, predictor="first_interaction", target="status")
% of conversion for first_interaction = 'Yes': nan%
status                0     1   All
first_interaction                  
All                3235  1377  4612
Website            1383  1159  2542
Mobile App         1852   218  2070
------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image

Low use of Mobile App --needs improvement, increase appeal, offer signup discounts.

Website has medium impact in conversion.

In [26]:
# checking the median value
data_clean.groupby(["status"])["time_spent_on_website"].median()
Out[26]:
time_spent_on_website
status
0 317.00000
1 789.00000

Leads that do convert spend more than double the amount of time browsing the website.

Let's create a feature for customer engagement by using webtime visits, time spend and page views per visit.

People browsing the website or the mobile app are generally required to create a profile by sharing their personal details before they can access more information. Let's see if the profile completion level has an impact on lead status

In [27]:
stacked_barplot(data=data_clean, predictor='profile_completed', target='status')
% of conversion for profile_completed = 'Yes': nan%
status                0     1   All
profile_completed                  
All                3235  1377  4612
3                  1318   946  2264
2                  1818   423  2241
1                    99     8   107
------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image

A higher profile completion is positively correlated to a higher conversion rate.

After a lead shares their information by creating a profile, there may be interactions between the lead and the company to proceed with the process of enrollment. Let's see how the last activity impacts lead conversion status

In [28]:
stacked_barplot(data=data_clean, predictor='last_activity', target='status')
% of conversion for last_activity = 'Yes': nan%
status               0     1   All
last_activity                     
All               3235  1377  4612
Email Activity    1587   691  2278
Website Activity   677   423  1100
Phone Activity     971   263  1234
------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image

Most lead-related purchases come from customers whose last interaction was through the website.

Let's see how advertisement and referrals impact the lead status

In [29]:
stacked_barplot(data=data_clean, predictor='newspaper', target='status')
% of conversion for newspaper = 'Yes': 31.99%
status        0     1   All
newspaper                  
All        3235  1377  4612
No         2897  1218  4115
Yes         338   159   497
------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image

Out of 497 leads coming from Newspaper, 32% of leads converted to sales.

In [30]:
stacked_barplot(data=data_clean, predictor='magazine', target='status')
% of conversion for magazine = 'Yes': 32.19%
status       0     1   All
magazine                  
All       3235  1377  4612
No        3077  1302  4379
Yes        158    75   233
------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image

75/233 leads converted coming from magazines. 32% conversion rate.

In [31]:
stacked_barplot(data=data_clean, predictor='digital_media', target='status') # Complete the code to plot stacked_barplot for digital_media and status
% of conversion for digital_media = 'Yes': 31.88%
status            0     1   All
digital_media                  
All            3235  1377  4612
No             2876  1209  4085
Yes             359   168   527
------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image

168/527 leads converted coming from Digital Media. Again 32% conversion rate.

In [32]:
stacked_barplot(data=data_clean, predictor='educational_channels', target='status') # Complete the code to plot stacked_barplot for educational_channels and status
% of conversion for educational_channels = 'Yes': 27.94%
status                   0     1   All
educational_channels                  
All                   3235  1377  4612
No                    2727  1180  3907
Yes                    508   197   705
------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image

197/705 leads converted coming from Educational Channels. Not a great conversion rate at almost 28%.

In [33]:
stacked_barplot(data=data_clean, predictor='referral', target='status') # Complete the code to plot stacked_barplot for referral and status
% of conversion for referral = 'Yes': 67.74%
status       0     1   All
referral                  
All       3235  1377  4612
No        3205  1314  4519
Yes         30    63    93
------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image

Very good conversion rate for referrals at 67.7%

So overall among paid media there's a 32% lead conversion rate or less (1 in 3).

Referrals are the leads most likely to convert to paid customers (2 out of 3).

Let's compare the preferred channels.

In [34]:
# Sum 'Yes' for each channel
channel_counts = {}
for col in channel_columns:
    count_yes = (data_clean[col].astype(str) == 'Yes').sum()
    percentage_yes = (count_yes / len(data_clean)) * 100
    channel_counts[col] = {'Count_yes': count_yes, '(%)': round(percentage_yes, 2)}

# Convert to DataFrame
channel_df = pd.DataFrame(channel_counts).T.reset_index()
channel_df.rename(columns={'index': 'Channel'}, inplace=True)

# Visualización
plt.figure(figsize=(8, 5))
sns.barplot(x='Channel', y='Count_yes', data=channel_df, palette='mako')
plt.title('Best channels', fontsize=14)
plt.xlabel('Channel', fontsize=12)
plt.ylabel('# of Leads', fontsize=12)
plt.xticks(rotation=30)
plt.tight_layout()
plt.show()

# Mostrar tabla
print(channel_df)
No description has been provided for this image
                Channel  Count_yes      (%)
0             newspaper  497.00000 10.78000
1              magazine  233.00000  5.05000
2         digital_media  527.00000 11.43000
3  educational_channels  705.00000 15.29000
4              referral   93.00000  2.02000

Most of the leads come from Educational Channels, followed by Digital Media and Newspaper.

Educational channels are the best lead generators, followed by digital media and newspaper. Referrals generate the least amount of leads.

Outlier Check¶

In [35]:
# outlier detection using boxplot


plt.figure(figsize=(15, 12))

for i, variable in enumerate(numeric_cols):
    plt.subplot(4, 4, i + 1)
    plt.boxplot(data_clean[variable], whis=1.5)
    plt.tight_layout()
    plt.title(variable)

plt.show()
No description has been provided for this image

Data preparation for modeling.¶

In [36]:
data_clean.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4612 entries, 0 to 4611
Data columns (total 15 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   age                    4612 non-null   float64
 1   current_occupation     4612 non-null   object 
 2   first_interaction      4612 non-null   object 
 3   profile_completed      4612 non-null   int64  
 4   website_visits         4612 non-null   float64
 5   time_spent_on_website  4612 non-null   float64
 6   page_views_per_visit   4612 non-null   float64
 7   last_activity          4612 non-null   object 
 8   newspaper              4612 non-null   object 
 9   magazine               4612 non-null   object 
 10  digital_media          4612 non-null   object 
 11  educational_channels   4612 non-null   object 
 12  referral               4612 non-null   object 
 13  status                 4612 non-null   int64  
 14  engagement_score       4612 non-null   float64
dtypes: float64(5), int64(2), object(8)
memory usage: 540.6+ KB
In [37]:
data_clean.columns
Out[37]:
Index(['age', 'current_occupation', 'first_interaction', 'profile_completed',
       'website_visits', 'time_spent_on_website', 'page_views_per_visit',
       'last_activity', 'newspaper', 'magazine', 'digital_media',
       'educational_channels', 'referral', 'status', 'engagement_score'],
      dtype='object')
In [38]:
# Numerical variables standardization
scaler = StandardScaler()

scaled_data = data_clean.copy()
numeric_cols.append('engagement_score')
scaled_data[numeric_cols] = scaler.fit_transform(scaled_data[numeric_cols])
In [39]:
# One-hot encoding to all categorical columns
data_encoded = pd.get_dummies(scaled_data, columns=categorical_cols, drop_first=True)

# Remove duplicates
data_encoded = data_encoded.loc[:, ~data_encoded.columns.duplicated()]

print(data_encoded.head())
       age  profile_completed  website_visits  time_spent_on_website  \
0  0.82057                  3         1.48374                1.23024   
1  0.74459                  2        -0.60598               -0.86187   
2  0.44064                  2        -0.18804               -0.52976   
3  0.51662                  3         0.22991               -0.34960   
4 -1.76301                  3         0.22991               -0.16674   

   page_views_per_visit  status  engagement_score  current_occupation_Student  \
0              -0.63593       1           1.23228                       False   
1              -1.56538       0          -0.86425                       False   
2              -1.71375       0          -0.53154                       False   
3              -0.51772       1          -0.34954                       False   
4               2.02573       0          -0.16445                        True   

   current_occupation_Unemployed  first_interaction_Website  \
0                           True                       True   
1                          False                      False   
2                          False                       True   
3                           True                       True   
4                          False                       True   

   last_activity_Phone Activity  last_activity_Website Activity  \
0                         False                            True   
1                         False                            True   
2                         False                            True   
3                         False                            True   
4                         False                           False   

   newspaper_Yes  magazine_Yes  digital_media_Yes  educational_channels_Yes  \
0           True         False               True                     False   
1          False         False              False                      True   
2          False         False               True                     False   
3          False         False              False                     False   
4          False         False              False                     False   

   referral_Yes  
0         False  
1         False  
2         False  
3         False  
4         False  
In [40]:
X = data_encoded.drop('status', axis=1)

Y = data["status"]  # Define the dependent variable (target)


print(X.columns)

# Splitting the data in 70:30 ratio for train to test data
X_train, X_test, y_train, y_test = train_test_split(
    X, Y, test_size=0.30, random_state=1
)
Index(['age', 'profile_completed', 'website_visits', 'time_spent_on_website',
       'page_views_per_visit', 'engagement_score',
       'current_occupation_Student', 'current_occupation_Unemployed',
       'first_interaction_Website', 'last_activity_Phone Activity',
       'last_activity_Website Activity', 'newspaper_Yes', 'magazine_Yes',
       'digital_media_Yes', 'educational_channels_Yes', 'referral_Yes'],
      dtype='object')
In [41]:
print("Shape of Training set : ", X_train.shape)
print("Shape of test set : ", X_test.shape)
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True))
Shape of Training set :  (3228, 16)
Shape of test set :  (1384, 16)
Percentage of classes in training set:
status
0   0.70415
1   0.29585
Name: proportion, dtype: float64
Percentage of classes in test set:
status
0   0.69509
1   0.30491
Name: proportion, dtype: float64

Percentage of classes in training set:

Meaning: In the training set, 70.4% of the examples have status = 0 (not converted) and 29.6% have status = 1 (converted).

Interpretation: There is a moderate imbalance in the classes, but it's not extreme. The model will see more examples of unconverted leads than converted ones.

Percentage of classes in test set:

Meaning: On the test set, 69.5% of the examples have status = 0 and 30.5% have status = 1.

Interpretation: The proportion of classes on the test set is very similar to that on the training set, which is good for evaluating the model fairly.

Building a Decision Tree model¶

In [42]:
# Import libraries
from sklearn.model_selection import GridSearchCV

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_auc_score, roc_curve
In [43]:
# Create the model
dt_model = DecisionTreeClassifier(random_state=1)


# Train the model
dt_model.fit(X_train, y_train)

# Predict on test set
y_pred = dt_model.predict(X_test)
print(len(y_pred))
# Accuracy
print("Accuracy on test set:", accuracy_score(y_test, y_pred))


# Get confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Graph CM
plt.figure(figsize=(5,4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['No', 'Yes'], yticklabels=['No', 'Yes'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Decision Tree Confusion Matrix')
plt.show()

# Confusion matrix
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))
1384
Accuracy on test set: 0.8078034682080925
No description has been provided for this image
Confusion Matrix:
 [[830 132]
 [134 288]]

The Decision Tree model achieved an accuracy of 0.807 on the test set, indicating good overall performance in distinguishing between converted and unconverted leads.

The confusion matrix shows that the model correctly classified 830 unconverted leads and 288 converted leads. However, 134 leads that did convert were misclassified as unconverted (false negatives), and 132 unconverted leads were misclassified as converted (false positives). The model performs reasonably well, but could be improved to reduce false negatives if the business prioritizes not missing out on conversion opportunities.

In [44]:
print("\nClassification Report:\n", classification_report(y_test, y_pred))


plt.figure(figsize=(20,10))
tree.plot_tree(dt_model, feature_names=X.columns, class_names=['Not Converted', 'Converted'], filled=True, max_depth=3)
plt.title("Decision Tree (first 3 levels)")
plt.show()
Classification Report:
               precision    recall  f1-score   support

           0       0.86      0.86      0.86       962
           1       0.69      0.68      0.68       422

    accuracy                           0.81      1384
   macro avg       0.77      0.77      0.77      1384
weighted avg       0.81      0.81      0.81      1384

No description has been provided for this image

The Decision Tree model shows an overall accuracy of 81%. Class 0 (non-converted) has better metrics with precision and recall at 0.86, than Class 1 (converted), at around 68%. This indicates that the model is more effective at identifying leads that do not convert, but has room for improvement in identifying leads that do convert.

Do we need to prune the tree?¶

Pruning the decision tree (using parameters like max_depth or min_samples_leaf) helps prevent overfitting and improves the model’s ability to generalize to new data. It is a recommended best practice, especially if the unpruned tree shows signs of overfitting.

In [45]:
# Define the range of depths to be tested
param_grid = {'max_depth': range(2, 21)}  # Depth 2 - 20

# Configure GridSearchCV
grid_search = GridSearchCV(estimator=dt_model, param_grid=param_grid, cv=5, scoring='accuracy')

# Fit the model to the training data
grid_search.fit(X_train, y_train)

# Better depth and average accuracy
best_depth = grid_search.best_params_['max_depth']
best_score = grid_search.best_score_

print(f"Best tree depth: {best_depth}")
print(f"Average accuracy (cross-validation): {best_score:.4f}")

# Graphing accuracy vs. depth
results = grid_search.cv_results_
plt.figure(figsize=(8,5))
sns.lineplot(x=param_grid['max_depth'], y=results['mean_test_score'], marker='o')
plt.title('Accuracy vs. Tree Depth (Cross Validation)')
plt.xlabel('Max Depth')
plt.ylabel('Average accuracy (CV)')
plt.axvline(x=best_depth, color='red', linestyle=':', linewidth=2, label=f'Best Depth = {best_depth}')
plt.grid(True)
plt.show()
Best tree depth: 5
Average accuracy (cross-validation): 0.8569
No description has been provided for this image

The best depth for the decision tree is 5, and at that depth the model achieves an average accuracy of 85.69% on the training data.

In [46]:
# Train the tree with the best depth
dt_best = DecisionTreeClassifier(max_depth=best_depth, random_state=1)
dt_best.fit(X_train, y_train)

# Evaluate model on test set
y_pred = dt_best.predict(X_test)
print("Accuracy on test set:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))
Accuracy on test set: 0.8547687861271677
              precision    recall  f1-score   support

           0       0.89      0.91      0.90       962
           1       0.78      0.73      0.75       422

    accuracy                           0.85      1384
   macro avg       0.83      0.82      0.83      1384
weighted avg       0.85      0.85      0.85      1384

By pruning the decision tree and selecting the optimal depth using cross-validation, the model's accuracy on the test set increased from 85% to 86%. Furthermore, the F1 score for the converted lead class improved from 0.68 to 0.75, indicating the model's improved ability to correctly identify leads most likely to convert. Precision and recall also increase considerably. This demonstrates that pruning helps prevent overfitting and improves the model's generalization capabilities.

In [47]:
# Plotting tree with best depth

plt.figure(figsize=(20,10))
tree.plot_tree(dt_best, feature_names=X.columns, class_names=['Not Converted', 'Converted'], filled=True, max_depth=best_depth)
plt.title(f"Decision Tree (first {best_depth} levels)")
plt.show()
No description has been provided for this image
In [48]:
# Obtain the importance of each variable
feature_importances = pd.DataFrame({
    'Variable': X.columns,
    'Importance': dt_best.feature_importances_
}).sort_values(by='Importance', ascending=False)

# Visualize top 10 important variables
plt.figure(figsize=(10,6))
sns.barplot(x='Importance', y='Variable', data=feature_importances.head(10), palette='viridis')
plt.title(f'Top 10 most important features in the decision tree (max_depth={best_depth})', fontsize=14)
plt.xlabel('Importance')
plt.ylabel('Variable')
plt.tight_layout()
plt.show()

print(feature_importances)
No description has been provided for this image
                          Variable  Importance
3            time_spent_on_website     0.27123
8        first_interaction_Website     0.26656
1                profile_completed     0.20862
7    current_occupation_Unemployed     0.06739
6       current_occupation_Student     0.05784
9     last_activity_Phone Activity     0.05307
0                              age     0.03922
10  last_activity_Website Activity     0.01848
4             page_views_per_visit     0.01185
2                   website_visits     0.00573
5                 engagement_score     0.00000
11                   newspaper_Yes     0.00000
12                    magazine_Yes     0.00000
13               digital_media_Yes     0.00000
14        educational_channels_Yes     0.00000
15                    referral_Yes     0.00000

Focus on the top features: The Decision Tree model indicates that the most influential factors for lead conversion are the time spent on the website, whether the first interaction was through the website, and having a highly completed profile. These should be prioritized in marketing and lead nurturing strategies.

Business recommendation: Efforts to increase user engagement on the website and encourage users to complete their profiles may significantly improve conversion rates.

On low-importance features: Channels like newspaper, magazine, digital media, educational channels, and referrals did not show predictive power in this model, suggesting they may be less relevant for targeting high-conversion leads in this dataset.

In [49]:
from sklearn.metrics import roc_curve, auc
# Get predicted probabilities for the positive class
y_proba = dt_best.predict_proba(X_test)[:, 1]

# Calculate ROC curve and AUC
fpr, tpr, _ = roc_curve(y_test, y_proba)
roc_auc = auc(fpr, tpr)

# Plot ROC Curve
plt.figure(figsize=(6,6))
plt.plot(fpr, tpr, color='blue', lw=2, label=f'ROC curve (AUC = {roc_auc:.2f})')
plt.plot([0,1],[0,1], color='gray', linestyle='--')
plt.title('ROC Curve', fontsize=14)
plt.xlabel('False Positive Rate', fontsize=12)
plt.ylabel('True Positive Rate', fontsize=12)
plt.legend(loc='lower right')
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

print(f"ROC-AUC Score: {roc_auc:.3f}")
No description has been provided for this image
ROC-AUC Score: 0.923

The AUC (Area Under the Curve) quantifies the overall ability of the model to discriminate between positive and negative classes.

  • AUC = 0.92: Excellent performance, the model is very good at distinguishing between classes.

Building a Random Forest model¶

In [50]:
# Train Random Forest
rf_model = RandomForestClassifier(n_estimators=100, random_state=1)
rf_model.fit(X_train, y_train)

# Predictions
y_pred = rf_model.predict(X_test)

# Model evaluation
print("Accuracy on test set:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(5,4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['No', 'Yes'], yticklabels=['No', 'Yes'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Random Forest Confusion Matrix')
plt.show()


print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))
Accuracy on test set: 0.8547687861271677

Classification Report:
               precision    recall  f1-score   support

           0       0.87      0.93      0.90       962
           1       0.80      0.69      0.74       422

    accuracy                           0.85      1384
   macro avg       0.84      0.81      0.82      1384
weighted avg       0.85      0.85      0.85      1384

No description has been provided for this image
Confusion Matrix:
 [[890  72]
 [129 293]]

The confusion matrix from the Random Forest shows an increased number of True Positives and True Negatives, as well as a lower number in FP-FN. This means it improves the model's capacities of predicting converting and non-converting leads.

Accuracy increased from 80% to 85.4%

In [51]:
# Configure GridSearchCV
grid_search = GridSearchCV(estimator=rf_model, param_grid=param_grid, cv=5, scoring='accuracy')

# Fit the model to the training data
grid_search.fit(X_train, y_train)

# Better depth and average accuracy
best_depth = grid_search.best_params_['max_depth']
best_score = grid_search.best_score_

print(f"Best tree depth: {best_depth}")
print(f"Average accuracy (cross-validation): {best_score:.4f}")

# Graphing accuracy vs. depth
results = grid_search.cv_results_
plt.figure(figsize=(8,5))
sns.lineplot(x=param_grid['max_depth'], y=results['mean_test_score'], marker='o')
plt.title('Accuracy vs. Tree Depth (Cross Validation)')
plt.xlabel('Max Depth')
plt.ylabel('Average accuracy (CV)')
plt.axvline(x=best_depth, color='red', linestyle=':', linewidth=2, label=f'Best Depth = {best_depth}')
plt.grid(True)
plt.show()
Best tree depth: 9
Average accuracy (cross-validation): 0.8600
No description has been provided for this image

Choosing max_depth=9 (as found by cross-validation) is necessary to achieve the best balance between model complexity and generalization. This prevents overfitting and ensures the model performs well on new, unseen leads with and 86% accuracy.

In [52]:
# Plotting tree with best depth

plt.figure(figsize=(20,10))

# Plot with best depth
tree.plot_tree(dt_best, feature_names=X.columns, class_names=['Not Converted', 'Converted'], filled=True, max_depth=best_depth)
plt.title(f"Random Forest (first {best_depth} levels)")
plt.show()
No description has been provided for this image
In [53]:
# Variable importance
importances = pd.DataFrame({
    'Variable': X.columns,
    'Importance': rf_model.feature_importances_
}).sort_values(by='Importance', ascending=False)

plt.figure(figsize=(10,6))
sns.barplot(x='Importance', y='Variable', data=importances.head(10), palette='viridis')
plt.title('Top 10 Most Important Features in Random Forest')
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.tight_layout()
plt.show()

print(importances.head(10))
No description has been provided for this image
                          Variable  Importance
8        first_interaction_Website     0.17869
5                 engagement_score     0.17006
3            time_spent_on_website     0.15378
1                profile_completed     0.10686
4             page_views_per_visit     0.09772
0                              age     0.09667
2                   website_visits     0.05060
9     last_activity_Phone Activity     0.03313
7    current_occupation_Unemployed     0.03019
10  last_activity_Website Activity     0.01952

Random forest assigns an important weight to our engineered feature - engagement_score. This means it has a significant impact in predicting lead conversion.

Leads generated from the website are more likely to convert.

In [54]:
y_proba = rf_model.predict_proba(X_test)[:, 1]

# Calculate ROC curve and AUC. True Positive Rate vd False Positive Rate
fpr, tpr, _ = roc_curve(y_test, y_proba)
roc_auc = auc(fpr, tpr)


# Plot ROC Curve
plt.figure(figsize=(6,6))
plt.plot(fpr, tpr, color='blue', lw=2, label=f'ROC curve (AUC = {roc_auc:.2f})')
plt.plot([0,1],[0,1], color='gray', linestyle='--')
plt.title('ROC Curve', fontsize=14)
plt.xlabel('False positive rate', fontsize=12)
plt.ylabel('True positive rate', fontsize=12)
plt.legend(loc='lower right')
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()
No description has been provided for this image

AUC = 0.92 means the model has good discriminatory power:

If you randomly pick one converted lead and one non-converted lead, the model will rank the converted lead higher 92% of the time.

The curve is close to the top-left corner, which indicates:

High TPR (Recall): The model correctly identifies most converters. Low FPR: Few non-converters are incorrectly classified as converters.

Do we need to prune the tree?¶

In [55]:
# Prune the Random Forest by setting max_depth and other pruning parameters
rf_pruned = RandomForestClassifier(
    n_estimators=100,
    max_depth=best_depth,             # Set to your optimal depth
    min_samples_leaf=1,       # You can adjust this as needed
    min_samples_split=2,      # You can adjust this as needed
    random_state=1
)
print("Best depth:", best_depth)
rf_pruned.fit(X_train, y_train) # Fit on the resampled training data

# Evaluate the pruned model
y_pred = rf_pruned.predict(X_test) # Predict on the resampled test data

print("Accuracy on test set:", accuracy_score(y_test, y_pred)) # Use y_test for evaluation
print("\nClassification Report:\n", classification_report(y_test, y_pred)) # Use y_test for evaluation
Best depth: 9
Accuracy on test set: 0.8547687861271677

Classification Report:
               precision    recall  f1-score   support

           0       0.87      0.92      0.90       962
           1       0.80      0.70      0.75       422

    accuracy                           0.85      1384
   macro avg       0.84      0.81      0.82      1384
weighted avg       0.85      0.85      0.85      1384

Setting max_depth=9 is a form of pruning: you are limiting the maximum depth of each tree to 9, which prevents overfitting and improves generalization.

Actionable Insights and Recommendations¶

Identify which leads are more likely to convert to paid customers¶

The ML models (Decision Tree and Random Forest) achieved high accuracy (85–87%) in predicting lead conversion.

Key predictors

  • Time spent on website
  • First interaction via website
  • Profile completion level
  • Page views per visit
  • Age

Action

Use the trained model to score new leads and prioritize those with high predicted conversion probability for sales and follow-up.

In [56]:
# Segment Leads by propensity (Hot, Warm, Cold)

df = data_clean.copy()

# Define segmentation logic
def segment_lead(row):
    if row['profile_completed'] == 2 and row['website_visits'] >= 3 and row['referral'] == 'Yes':
        return 'Hot'
    elif row['website_visits'] >= 2:
        return 'Warm'
    else:
        return 'Cold'

# Apply segmentation
df['Lead_Segment'] = df.apply(segment_lead, axis=1)

# Summary of segments
segment_summary = df.groupby('Lead_Segment').agg({
    'website_visits': 'mean',
    'time_spent_on_website': 'mean',
    'Lead_Segment': 'count'
}).rename(columns={'Lead_Segment': 'Count'})

print("Lead Segmentation Summary:")
print(segment_summary)
Lead Segmentation Summary:
              website_visits  time_spent_on_website  Count
Lead_Segment                                              
Cold                 0.81270              550.56297    929
Hot                  5.83333             1229.72222     18
Warm                 4.10668              765.49304   3665

Most influential factors for conversion

time_spent_on_website (25.8%) is the top predictor: This means the more time a lead spends on the website, the higher the likelihood of conversion.

first_interaction_Website (18.2%) is also highly important: Leads whose first interaction was through the website are more likely to convert.

page_views_per_visit (12.1%) and age (11.3%) are also strong predictors: More page views per visit and certain age groups are associated with higher conversion rates.

Profile completion matters: profile_completed_3 (6.3%) and profile_completed_2 (4.4%) indicate that leads with more complete profiles are more likely to convert.

Other relevant factors: website_visits (6.2%): More visits to the website increase the chance of conversion.

**: The type of last activity also plays a role, being email the best one for predicting.

current_occupation_Unemployed (3.1%): This occupation group has some predictive power, but less than the top web interaction features.

Business implications: Focus on digital engagement: The most important features are related to user engagement on the website. This suggests that strategies to increase time spent, page views, and profile completion could significantly improve conversion rates.

Less importance for other channels: Features not listed here (such as magazine, newspaper, digital media, referrals) have negligible or zero importance, indicating they are not strong predictors of conversion in this dataset.

How to use these insights:

  • For marketing: Invest in improving the website experience and encourage users to complete their profiles.
  • For sales: Prioritize leads who spend more time on the website, have higher page views per visit, and whose first interaction was online.
  • For product: Consider features that make it easier for users to explore more pages and complete their profiles.

Objective 2: Find the factors driving the lead conversion process¶

  • Website engagement is critical:

Leads who spend more time and visit more pages on the website are much more likely to convert.

  • Profile completion matters:

Leads with highly completed profiles (75–100%) have a significantly higher conversion rate.

  • First interaction channel:

Leads whose first interaction is via the website convert at a much higher rate than those via the mobile app.

  • Last activity:

Email is how customers interact and re-engage with the platform the most. Keep the customers engaged via this platform.

  • Referral channel:

Although referrals generate fewer leads, their conversion rate is the highest (67.7%).

  • Action:

Focus marketing efforts on increasing website engagement and encouraging profile completion. Improve the mobile app experience to boost conversion and increase visit time. Incentivize referrals, as they have the highest conversion rate.

Example Mobile App improvement:

mobile_app.png

  1. Earning rewards to promote engagement through the app. mobile_app2.png

Example: Referrals Campaign

referrals.png

  • Website interaction is key. Prioritize web interactions, add chatbots, capture visitors with promotions, subscription links, special offers, pop-ups, etc.

Objective 3: Create a profile of leads likely to convert¶

Profile of high-converting leads:

  • Age: Typically older (mean age ~50 for professionals and unemployed; students are younger but convert less).

  • Occupation: Professionals convert most, followed by unemployed; students convert least.

  • First interaction: Website.

  • High profile completion (75–100%).

  • High website activity (more visits, more time spent, more pages viewed).

  • Last activity: Website or phone.

  • Referral source: Highest conversion rate.

  • Action: Target professionals and unemployed individuals with tailored messaging. Encourage all leads to complete their profiles and interact more with the website. Use the model to segment and prioritize leads for sales outreach.

professional.png

  • Woman, professional, 40-50 years old, upskilling from her phone during commute.

education_forall.png

  • Education for all - campaign example.

complete_profile.png

  • Example campaign for Profile Completion

Business Recommendations

Prioritize leads with high website engagement and profile completion for sales follow-up.

Enhance the website and mobile app experience to increase time spent and page views.

Develop campaigns to encourage profile completion (e.g., progress bars, incentives).

Leverage referral programs, as they yield the highest conversion rates.

Use the ML model to automate lead scoring and resource allocation, focusing on leads most likely to convert.

🧠 Modeling Insights

Algorithms Used: Decision Tree & Random Forest. Best Model: Pruned Random Forest (depth=9)

Accuracy: ~85.4% F1 Score (Class 1 - Converted): Improved from 0.68 to 0.75

Top Predictive Features:

Time spent on website (25.8%)

First interaction via Website

Profile completion level

Last activity type (Website > Email > Phone)

Referral source

Conclusion:

By connecting the model's results to the business objectives, ExtraaLearn can allocate resources more efficiently, improve conversion rates, and grow its customer base by focusing on the most promising leads and the factors that drive their conversion.